Skip to content

grpc: Add noncebalancer that tracks non-READY backends#8672

Closed
beautifulentropy wants to merge 1 commit intomainfrom
fix-noncebalancer
Closed

grpc: Add noncebalancer that tracks non-READY backends#8672
beautifulentropy wants to merge 1 commit intomainfrom
fix-noncebalancer

Conversation

@beautifulentropy
Copy link
Copy Markdown
Member

@beautifulentropy beautifulentropy commented Mar 11, 2026

The nonce service's maxConnectionAge (30s) periodically sends a GOAWAY to the WFE's gRPC connections, causing affected SubConns to briefly leave READY state while reconnecting. Due to jitter on maxConnectionAge, the getNonceService and redeemNonceService connections to the same backend can GOAWAY at slightly different times, creating a window where the WFE can still issue nonces from a backend it can no longer redeem against.

The original nonce balancer/picker (grpc/noncebalancer) only tracks READY SubConns. So when a backend is reconnecting after a GOAWAY it is indistinguishable from a backend that does not exist; this results in a badNonce error for the subscriber. The v2 balancer fixes this by maintaining two maps: one for READY backends and one for not-READY backends. When a request targets a prefix whose backend exists but isn't READY, the picker returns ErrNoSubConnAvailable, which tells gRPC to queue the RPC and wait for the SubConn to reconnect (see picker_wrapper.go:159). Only genuinely unknown prefixes now produce ErrNoBackendsMatchPrefix.

To simplify comparison during review and testing in staging, the v2 balancer is implemented as a separate package (grpc/noncebalancerv2) alongside the existing grpc/noncebalancer. Either can be configured in the WFE by setting redeemNonceService.srvResolver to "nonce-srv" or "nonce-srv-v2" in the WFE config.

Note: grpc/noncebalancerv2/balancer.go is best compared directly against vendor/google.golang.org/grpc/balancer/base/balancer.go

Fixes #8662

@beautifulentropy beautifulentropy changed the title grpc: Add noncebalancer that tracks reconnecting backends grpc: Add noncebalancer that tracks non-READY backends Mar 11, 2026
@beautifulentropy beautifulentropy marked this pull request as ready for review March 11, 2026 21:33
@beautifulentropy beautifulentropy requested a review from a team as a code owner March 11, 2026 21:33
@beautifulentropy beautifulentropy requested a review from jsha March 11, 2026 21:33
@github-actions
Copy link
Copy Markdown
Contributor

@beautifulentropy, this PR appears to contain configuration and/or SQL schema changes. Please ensure that a corresponding deployment ticket has been filed with the new values.

@aarongable
Copy link
Copy Markdown
Contributor

Closing in favor of #8679

@aarongable aarongable closed this Mar 17, 2026
jsha added a commit that referenced this pull request Mar 19, 2026
The old noncebalancer only saw READY SubConns, which was a problem
during the brief periods when a SubConn was reconnecting (for instance
due to a GOAWAY from the server), since nonce redemption requests are
not fungible between backends. Unfortunately, READY SubConns are all
that the balancer interface provides. And we can't get that interface to
pass non-READY SubConns to our picker without reimplementing or copying
all its SubConn management logic.

Luckily, grpc provides the [`endpointsharding`] balancer implementation
that does exactly what we want. It maintains a collection of child
balancers each owning a single endpoint (note: for our setup an endpoint
is equivalent to a single address, though it _can_ be one-to-many). It
also lets us query the [state] of each child, including the endpoint
it's responsible for.

This allows us to construct a picker that is aware of all available
backends, even those that aren't currently READY. That, in turn,
prevents us from temporarily serving errors while a given nonce
redemption backend is reconnecting.

To see another example of `endpointsharding` in use, see the
[`customroundrobin`] implementation.

For more context on how `endpointsharding` came to be implemented, see
[gRFC A61: IPv4 and IPv6 Dualstack Backend Support](a61).

If you're curious _how_ `endpointsharding` passes around the information
about non-READY SubConns, it [uses a type assertion] from a
`balancer.Picker` to its internal type.

Alternative to #8672. Fixes #8662.

This edits `noncebalancer.go` in place for ease of diffing, and also
copies the original `grpc/noncebalancer` (with no edits) to
`grpc/noncebalancerv1`. But don't take my word for it:

```bash
diff <(git show origin/main:grpc/noncebalancer/noncebalancer.go) grpc/noncebalancerv1/noncebalancer.go
diff <(git show origin/main:grpc/noncebalancer/noncebalancer_test.go) grpc/noncebalancerv1/noncebalancer_test.go
```

[`endpointsharding`]:
https://pkg.go.dev/google.golang.org/grpc/balancer/endpointsharding
[state]:
https://pkg.go.dev/google.golang.org/grpc/balancer/endpointsharding#ChildState
[a61]:
https://github.com/grpc/proposal/blob/master/A61-IPv4-IPv6-dualstack-backends.md
[`customroundrobin`]:
https://github.com/grpc/grpc-go/blob/99f36d4a0c28bc967a8d3fe23ebc2a264b322070/examples/features/customloadbalancer/client/customroundrobin/customroundrobin.go
[uses a type assertion]:
https://github.com/grpc/grpc-go/blob/99f36d4a0c28bc967a8d3fe23ebc2a264b322070/balancer/endpointsharding/endpointsharding.go#L324
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Fix badNonce CI flake

2 participants